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Abstract 

^>- I In practical applications, we often have to deal with high order data, such as a grayscale 

r ■\ . image and a video sequence are intrinsically 2nd-order tensor and 3rd-order tensor, respectively. 

. ' For doing clustering or classification of these high order data, it is a conventional way to vectorize 

Y^ , these data before hand, as PCA or FDA does, which often induce the curse of dimensionality 

problem. For this reason, experts have developed many methods to deal with the tensorial 

data, such as multilinear PCA, multilinear LDA, and so on. In this paper, we still address 

the problem of high order data representation and recognition, and propose to study the result 

^ ' of merging multilinear PCA and multilinear LDA into one scenario, we name it GDA for the 

^ I abbreviation of Generalized Discriminant Analysis. To evaluate GDA, we perform a series of 

i^~ ■ experiments, and the experimental results demonstrate our GDA outperforms a selection of 

L^ ■ competing methods such (2D)^PCA, (2D)^LDA, and MDA. 

cn 

Csj ■ 1 Introduction 



Appearance-based paradigm has been widely employed in the areas of pattern recognition, computer 
vision and signal processing. One primary advantage of appearance-based methods is that it does 
not necessarily create representations or models for the objects since, for a given object, its model 
Cd ■ is implicitly defined by the selection of the samples of the object. When using appearance-based 

methods, we usually represent each sample as a vector by vectorizing the data, in other words, 
we convert an image of size n x m into a vector of nm-dimensional space before hand. Among 
various methods dealing with the vector-represented data, principal component analysis (PCA) and 
linear discriminant analysis (LDA) are the most representative unsupervised and supervised learning 
method, respectively. 

However, when the dimensionality becomes extremely high, PCA and LDA turn to be a time- 
consuming bottleneck. Therefore, some researchers and experts seek multidimensional methods 
that works well in facing with high order data without vectorizing them, such (2D)^PCA [1^ and 
(2D)2LDA Hj. 

(2D)^PCA and (2D)^LDA have been both shown effective in dealing with 2nd-ordcr data, such as 
image classification and face recognition. However, the two methods are limited in 2D data, so when 
facing higher order data as video data, they will no longer obtain better results. Furthermore, He et 
al. propose a method to deal with higher-order data, i.e. tensor subspace analysis (TSA) [4], which 



exploits the label information and solves an optimization problem by minimizing the ratio of the 
within-class scatter and the between-class scatter. Although they say their algorithm can be easily 
extended to higher order tensors, it is not convenient to do so due to the optimization formulation 
they solve. 

Tao et al. propose an alternating projection optimization for supervised tensor learning, which is 
called tensor fisher discriminant analysis, i.e. TFDA [7]. They focuse on the property that when a 
tensor data is multiplied by a vector, its dimension is reduced by one. Even though their framework is 
effective to do high order data recognition when the number of classes is small, it will obtain inferior 
results when facing large number of classes. Yan et al. propose multilinear discriminant analysis 
i.e. MDA [2, and a novel approach called /c-mode optimization to iteratively solve the optimization 
function. MDA has an advantage in doing high order data recognition, whereas due to relative high 
dimension in each mode, it is time-consuming in learning process. Moreover, MDA helps us avoid the 
curse of dimensionality and alleviate the small sample size problem to some extent, however, when the 
training sample size is much smaller than any mode of the sample tensor, their MDA may come across 
the disaster of singular situation caused by the curse of dimensionality. Inspired by their framework, 
we propose to first implement high order SVD (HOSVD) on the tcnsorial data for dimensionality 
reduction, and then use multilinear discriminant analysis to learn the most discriminative subspaces 
of the data. Actually, HOSVD preserves most valuable information, such as the spatial information 
and spatial-temporal relationship in video and holds a capability of smoothing, after which the noises 
in the samples are filtered out to some degree and the dimensions are reduced to a large extent. So 
the running time in learning process becomes much shorter, and as well, the recognition accuracy 
will be higher after further dimensionality reduction by multilinear discriminant analysis, as we can 
see in the experiments. We name the overall process GDA in this paper. 

The rest of the paper is organized as follows. Section [5] explains the notations and gives a brief 
description of the tensor algebra. Our GDA is described in Section [3l as well as some details about 
GDA. Experimental results are presented in Section SI Finally, we conclude our paper in Section [S] 



2 Preliminaries 

Notation for iV-way arrays can be complex, so we first explain the notations used in this paper, then 
we review the fundamental algebra of tensors. 

2.1 Notations 

Throughout this paper, !„, denotes the m x m identity matrix. A high order tensor, a matrix, and 
a vector are denoted by 3C, A and a. And scalars are denoted by lowercase letters, e.g. a. 

As we frequently use the characters i and j in the meaning of indices, / and J will be reserved 
to denote the index upper bounds. The k*'^ elements in a sequence is denoted by a superscript 
in parentheses, e.g. A^'^' , denotes the nth matrix in a sequence. When comes to the subscript in 
parentheses of boldface Euler script letter, such as 3C(j,), one should know it symbolizes a matrix 
flatten along the fc*'' mode. 

In practice, assume there are m training samples, each sample is represented as the iVth-order 
tensors, i.e. {Xj S R^^^-^^^'^^™, i = 1,2, ... ,m}, and OCi belongs to the class indexed as c; € 
{1, 2, . . . , C} where C is the number of different labels or classes. Consequently, the sample set can 
be represented as an {N + l)*''-order sample tensor X G R^ixJ2x-x/jvxm^ 



2.2 Tensor Algebra 

This subsection briefly demonstrates the algebra of tensors which is the fundamental tool in our 
framework. A tensor [5] is a multi-dimensional array. Flattening a tensor is a kind of converting 
process of tensor matricization. Specifically, flattening a tensor along the ith mode, gives a matrix 
X^i) in which the columns are resulted from the tensor by varying the value of index i, while keeping 
the other indices fixed. 

Tensors can be multiplied together, so obviously the notation and symbols for this are much more 
complex than matrices. Here we just list some fundamental multiplications related to our work with- 
out further introduction and proof [5]. The k-mode (matrix) product of a tensor % G ^hxhx-'-xiN 
with a matrix u G R"^^^*" is denoted by CC x fcU and is of size /i x • • • x Ik-i x J x Ik+i x • • ■ x In- 
It can be calculated in terms of the flattened form: 

y = 3CxfcU^y(,)=UX(fc). (1) 

Let X G M^ix-f^x-x-fiv and A^ G R'^'-x/, f^j. ^^ ;, ^ |j^^. . . ^^v}. Then for any k G {1,- • • ,iV}, we 
have 

1^ =3C X iA(i) X 2A(2) X • . • X jvAW 

T (2) 

^ -ytfc) =A(^-)X(fe) ( A(^) »■••«. ac^+i) ® aC^-i) «)•■•«. a(i) ' 

For simplicity, we use |3C; A'^'A'^', . . . , A*^^^]] to denote Equation[51 which can be calculated ^ in 
the matrix sense by: 

^(fc) = A'^'^^k) (a(^) »•••«) aC^+i) ® Af'^-i) «)• • • ® A(i))^. (3) 

Furthermore, the norm of a tensor is defined as: 



l-^ll — v(^i^) — ll^(fc)||-F 

So the distance between tensors OC and y of the same dimensions is defined as dist{%, y) = \\% — y 



\ Z^ Z^ '" Z^ ^iii2...iN- ^ ) 

\ 41=1 42 = 1 ijV = l 



3 Generalized Discriminant Analysis 

In this section, we first review HOSVD. Then we introduce our proposed GDA, followed by fc-mode 
optimization. Finally the full algorithm and classification are presented successively. 

3.1 High Order SVD 

In order to generalize SVD for tensors first we have a look at matrix SVD. A matrix X has two 
vector spaces: a column space and a row space. SVD decomposes X into its two vector spaces as: 
X = USV'^ = SX1UX2V, where U and V represent the orthogonal column space and row space 
respectively. Then we come to SVD for tensors which have A^ associated vector spaces. HOSVD 
decomposes the tensor X G M^ix-x/Af ^^^^ ^-^^ ^ vector spaces by: 

X«-y X iV(i) X2V(2) X---X jvV(^\ (5) 



where V^'^) e W'-'^-'", V^'^'^^VW = Ij, and y e ]^'hy-x.JN ^ Here VW in which k e {1,2,---,A^} 
represents the /c-niode vector spaces, and ^ is the core tensor which shows the interaction between 
different spaces. 

According to Equation ([5]) and ([5]), the A^-mode SVD algorithm [51 [5] for decomposing tensor X is: 

1) For fc = 1,- • • , A^ compute the SVD of aC(fe), V^'') G M^'= ><"''= is the left singular vectors of X(fc), 
where Jk can be chosen less than Ik by a kind of criterion; 

2) The core tensor y is computed by: 

y«XxiV(i)"x2V(2)" x...xjvVW. (6) 

3.2 Multilinear Discriminant Analysis and fc-Mode Optimization 

In order to find the new tensor space that maximize the ratio of the between-class scatter and the 
within-class scatter, we tend to solve the optimization function: 

E n.||[M, - M;uw",u(2r, . . . ,uwif 

U^lf^i = argmax^^ , (7) 

where Mi and M represent the mean of the i*'' class and the global mean of the training data, 
respectively. However, the objective function ([7]) has no closed- form solution due to that the U'-'^-' \^^^ 
depends on each other, so we have to solve ([7]) by an iterative procedure. Having noticed \\X\\ = 
||X(fc)||F and ||Xf = rr(X^X) = Tr(XX^), and if we assume U^i), ..., Uf^^-i), Uf'^+i), ... ,UW 
are already known, then we can calculate U^'^^ by: 

tr(uW"SB(fc)UW) 

U^^') = arg max } ,,,^ — f . (8) 

U tr(UW"S^(fc)UW) 

Denote Up = U(^)^ «) • • •(8)U(''+i)^ ^uC^^i)^ (g)- • -(gjU^^)^, then Ss(fc) a.ndSw{k) are between-class 
and the within-class scatter matrix along the fc*'' mode respectively: 

i=l 
C 

The optimal projected subspace along mode-fc is spanned by the columns of U^*^' , which is the 
solution of equation ([5]). It is easy to explicitly solve the singular problem of S^/^,NSB(fe) to obtain 

XjC^). The iterative procedure to solved ([7]) is called fc-mode optimization, which is first put forward 
by Yan et al. [5]. We continue to use this term for its conciseness. 

3.3 Generalized Discriminant Analysis — Algorithmic Analysis 

If we directly iteratively solve U^'^' , the learning process will be time-consuming and we may meet 
the curse of dimensionality. Like Fisherface [5] , we first do the dimensionality reduction of the data 



Algorithm 1 Generalized Fisher Discriminant Analysis 



Input: The training set X e R/ix/sx-x/jvxm^ -^j^^j^. ^j^^gg labels a G {1,2,...,C}, where i = 
1, 2, . . . , ?Ti, and the final lower dimensions Ii x I2 x ■ ■ ■ x /^. 

Output: the projectors (V^'^'^U^''^) £ K''^''^^'', where I'f. < Ik and k G 
{1,2,..., A}. 
1: Use high order SVD to decompose the training set X as: 

±«-9xiV(i) x-.-XjvVW x^+il™ 

where ^ £ K./ix,/2x-x,/„xm ^j-^^^ yC^) G M^'--X'''= 
Initialize U*'') G R'''= ^^^ , where fc G {1, 2, . . . , A}; 

Calculate the projected mean of each class Mi and the projected global mean M; 
while stop criterion is not reached do 
for fc = 1,2,..., A do 



c 
SB(fc) =^". ((M, - M)(fc)) UJUp ((M, - M)(fc) ^ 



c 



i=l jdCi 

Solve the optimization problem through generalized eigenproblem: 

tr(uW^*+i^SB(fc)U('^'^*+i 



^(fc),t+i _ argmax 



u(^) tr(UW.*+i^SvF(fc)UW>*+i) 



Yj{k)d+i jg ^]^g j^ ^Qp eigenvectors of S^\j,sS5(fe). 
8: end for 
9: end while 



set via HOSVD. In this stage, we choose a threshold to achieve dimensionality reduction purpose: 

d 

^>^ (9) 

i=l 

where cti,- • -^ad is the d largest singular values of 3C(j.). However HOSVD does not exactly mean 
deleting the rest 1 — 9 features, instead it smoothes the samples by filtering out random noises to 
some extent, as Fig[3] shows the smoothing result from the silhouettes of a video clip. 

After HOSVD. we come to solve Equation (O. The lower dimension enables us to speed up the 
training process and avoid the singular situation. The whole algorithm are displayed Algorithm [TJ 
in which we use the superscript t to denote the resulting U^'^^ of the t*'* iteration. 



Figure 1: 10 samples in the ORL face database. 

3.4 Classification With GDA 

With the learned projectors {V^*"''!^]^} and {U*-'"-'!^]^}, the low-dimensional representation of 

the training sample "Xi, i = 1,2, ... ,m, can be computed as Z.j = Xi Xi (V'^^'U^^') x ■•• Xjv 

(V'-^-'U^^-') . When a new data % comes, wc first compute its low-dimensional representation as 

Z = [X; (V(i)u(i))^ , . . . , (V(^)U(^))'^]. Then its class label is predicted to be that of the sample 
whose low-dimensional representation is nearest to Z>, that is c^. where 

^* = argmin 112,-2,, II (10) 

i 

In this paper, wc use this nearest-neighbor method for the final classification throughout all the 
experiments owing to its simplicity in computation. 



4 Experimental Results 

In this section, wc conduct a series of experiments to consider the performance of our proposed GDA 
in dimensionality reduction, clustering and recognition. All of our experiments are carried out on a 
PC machine with Pentium(R) Dual-Core CPU and 4. GOG memory. 

4.1 Data Preparation 

Two benchmark databases, ORL [1] and Weizmann [3] are used in our experiments. ORL database 
contains 400 images of 40 individuals and each image is grayscale and normalized to the resolution of 
112*92 pixels. In ORL database, the images are taken at different times, varying the lighting, facial 
expressions (open or closed eyes, smiling or not smiling) and facial details (glasses or no glasses). All 
the images are taken against a dark homogeneous background with the subjects in an upright, frontal 
position (with tolerance for some side movement). Fig [T] illustrates 10 images of one individual from 
ORL database. 

Weizmann database is a recent database with a reasonable size reported in [3]. It contains ten 
action classes performed by nine individuals. The actions include bending (bend), jumping jack 
(jack), jumping-forward-on- two- legs (jump), jumping- in-place-on-two-legs (pjump), running (run), 
galloping sideways (side), skipping (skip), walking (walk), waving-one-hand (wavel), and waving- 
two-hands (wave2). Hence, wc have 90 video sequences in all. For action recognition experiment, we 
directly use the binary silhouettes of Weizmann database. Some of these silhouettes are deformed 
and noisy due to segmentation problems, but be still contained in the training set. In Weizmann 
database, each action video generally includes 2-4 complete action cycles. Using a single period is 
much more computationally efficient than using the entire length of the video. Also we need to have 
equal-length sequences in our tensor framework. So wc find the period for all the training action 
sequences such as 10 frames, which may be smaller than maximum period, but this problem can be 
solved by deleting the extra frames randomly. 



4.2 Dimensionality Reduction and Smoothing 

In this subsection, we conduct an experiment to show HOPCA (HOSVD), which is a part of our 
proposed GDA, can do better in dimensionahty reduction, and preserve spatial information of images 
and spatiotemporal relationship of videos. Furthermore, HOPCA can filter out the random noises 
in some sense. 

Essentially, (2D)2pCA QU] is 2nd-order PCA, so in this case, HOPCA equals to (2D)2pCA. In 
this experiment, we need to have a look at CR and PSNR, which denote compression ratio and 
peak signal-to-noise ratio, respectively. CR is defined as the ratio of the compressed size to the 
uncompressed size, and the formula for calculating PSNR is: 

PSNR ^20log^oi- 



\/ \\-^7ioised -^originalWp / '^^^ 

where X„oised and "^original are both matrices of the size m x n. Suppose there are M training 
face images with size m x n, the number of projection vectors in PCA and HOPCA is p, d and 
q. Then the compression ratios of PCA and HOPCA are computed as Mmn/{Mp + mnp) and 
Mmn/{Mdq -\- md + nq) respectively. 

We restore the images from the dimensionality reduced dataset. Fig. [2] shows some reconstruc- 
tion results under similar compression ratios. It is obvious that HOPCA preserves more inherent 
characteristics. 

Furthermore, we conduct experiments to see the abilities of PCA and HOPCA in representing action 
videos under similar compression ratios. We randomly select a period of video sequence in Wcizmann 
database, and for simplicity, we use the binary silhouette of the sequence here. The comparison shows 
HOPCA preserves spatial-temporal relationships when dealing with video compression, whereas PCA 
destroys the important information, as Fig. [3] shows. 

From this experiment, we can easily see HOSVD not only accomplishes dimensionality reduction 
better than that of PCA, but also preserves more valuable intrinsic structures such spatial and 
spatial-temporal relationships, which PCA will destroy. Therefore high order SVD can filter the 
noises out in some sense, which will boost the computation and enhance the performance in the 
recognition process. 

4.3 Visualization of Dataset — Clustering and Classification 

In this subsection, we use PCA, (2D)^PCA, MDA and GDA to project the images into a 2- 
dimensional subspace for visualization. This experiment helps us understand that our proposed 
GDA can obtain more discriminating power than the other 3 methods under a relative lower dimen- 
sional subspace. 

We select the first 5 individuals in ORL for this test and Fig. |4] shows the results. For PCA, we 
select the two eigenvectors corresponding to the first two largest eigenvalues of the covariance matrix, 
projecting the images into 2-dimensional subspace. And for (2D)^PCA, MDA and GDA, we project 
the images into either R^^^ and R^^^, both of which are 2-dimensional spaces. R^^^ is formed by 
the projection u^X[vi,V2] and R^^^ is formed by the projection [ui, U2]"^Xvi, here X stands for a 
image, and u and v are the columns of projectors. 

As can be seen from Fig.lU PCA performs the worst, it fails to distinguish the different classes from 
a clustering viewpoint. In contrary, (2D)^PCA clusters the data better than PCA, even if it also 
mixes some classes together. MDA, as a supervised method, does significantly well than PCA and 
(2D)^PCA, which can be seen from (d) and (c). As well, MDA also mixes some categories together. 



CR: 0.0800 CR: 0.1200 CR: 0.3001 CR: 0.5402 
P SNR:20. 51 P SNR:22. 79 PSNR:27.26 PSNR:29.09 

^.3 a 3 [1 
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PSNR:27.63 PSNR:28.59 PSNR:35.31 PSNR:40.00 
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Figure 2: Reconstructed image from ORL database by PCA and HOPCA. The original image is 
shown on the leftmost. The first row shows a series image of restoration from PCA reduction, and 
the second row displays that from HOPCA reduction. 

seen in Fig.|3](d). However, one fatal drawback of MDA is the between-class distance is much smaller 
than that of GDA, it may lead to inferior recognition, in the sense of overfitting. Clearly our GDA 
performs the best, it maintains a good distance between each classes, and also keeps the a good 
within-class aggregation. This illustrative example shows that GDA can have more discriminating 
power than others under a relatively lower dimension. 

4.4 Classification Result on ORL Database 

This subsection shows a set of experiments to test the performance of our GDA in face recognition, 
in which each image is represented as a 2nd-order tensor. 

Three sets of experiments were conducted to compare the face recognition performance of GDA with 
PCA, (2D)^PCA, (2D)^LDA and MDA. For ease of representation, the experiments are named as 
Trainm/Testn which means that m images of per person are randomly selected for training and the 
remaining n images for testing. 

In order to fairly evaluate the effectiveness of our GDA, we average the recognition accuracies by 
multiple iterations. Table [1] shows the average face recognition accuracies of all the algorithms in 
our experiments. The comparative results show our GDA outperforms the other four methods on 
the three sets of experiment, especially in the cases with a small number of training samples. Fig. [5] 
demonstrates the accuracies vs. the dimensionality of the four methods on ORL. 

Fig.[5](A) shows the recognition accuracies of (2D)^PCA, (2D)^LDA, MDA and GDA versus numbers 
of features along the row and column directions respectively on ORL database, here the training 
number and test number are both 5. 



4.5 Results on Weizmann Database 

In this subsection, we choose a higher order dataset, Weizmann database[3], to check the quality of 
our GDA in action recognition. 
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Figure 3: One reconstructed training video sequence on Weizmann database under similar com- 
pression ratio by PCA and HOPCA. Here Q is set as 0.95. It is obvious that PCA destroys the 
spatial-temporal information in the video, causing severe distortion, whereas HOPCA preserves the 
space-time relationship. 

Table 1: Recognition accuracy (%) comparison of our proposed GDA with other methods on ORL 
database 





Train5/Test5 


Train4/Test6 


Train3/Test7 


PCA 


90.85 


87.92 


83.25 


(2D)2pCA 


94.70 


92.58 


90.36 


(2D)2lDA 


94.80 


93.46 


89.68 


MDA 


96.50 


93.42 


83.30 


GDA 


97.10 


95.75 


92.82 



To compare GDA with other methods fairly, we compute the recognition accuracy using the leave- 
one-out method. Each time, we first leave out all the sequences pertaining to one person. Then we 
train using all the remaining sequences (80 sequences), and we use the 10 actions of the omitted 
person as test actions. We average the results from all the persons. 

Table [2] shows the recognition accuracies, where dimension, compression ratio, running time and 
iteration are acquired when the best accuracy achieves. We can see from the comparative results 
that our proposed GDA performs the best among all the algorithms. Even though Eigenface and 
Fisherface use less running time, their recognition accuracies are far less than that of MDA and GDA. 
However, MDA directly deal with the high-dimensional data, so the running time becomes much 
longer. Moreover, the compression ratio acquired in MDA also suffers from the high dimensionality, 
therefore it achieves no better results than that of GDA. 



5 Conclusion 



In this paper, we develop Generalized Discriminant Analysis (GDA). GDA provides a more natural 
representation for images, videos and other high order data, avoiding any models. By analysis, we 
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Figure 4: 2D visualization of five classes in ORL dataset. The HOSVD threshold (EquationlHl) in 
GDA is set 0.994 
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Figure 5: Comparisons of recognition accuracies on between MDA, PCA, (2D)^PCA, (2D)^LDA 
and GDA on ORL database. 

Table 2: Recognition accuracy (%) comparison of our proposed GDA with other methods on 
Weizmann database. In HOPCA, the first part of GDA, we set threshold 6 = 0.98. (CR stands for 
compression ratio.) 

Accuracy Dimension Compression Ratio Running Time (s) 



Eignface 


84.4 


11 


0.1226 


0.1198 


Fisherface 


95.6 


8 


0.8894 


0.1129 


MDA 


98.89 


7x7x3 


0.0056 


465.34 


GDA 


98.89 


6x3x3 


0.0028 


2.67 



show GDA enables us to avert the curse of dimensionality and it preserves the spatial and spatial- 
temporal relationship of the data. Through experiments, we see that GDA can alleviate the small 
sample size problem and shows high efficiency and effectiveness of computation. 
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